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Abstract 

Scientific data services are becoming an important 
part of the NASA Center for Climate Simulation’s mis- 
sion. Our technological response to this expanding role 
is built around the concept of a Virtual Climate Data 
Server (vCDS), repetitive provisioning, image-based 
deployment and distribution, and virtualization-as-a- 
service. The vCDS is an iRODS -based data server spe- 
cialized to the needs of a particular data-centric applica- 
tion. We use RPM scripts to build vCDS images in our 
local computing environment, our local Virtual Machine 
Environment, NASA’s Nebula Cloud Services, and 
Amazon’s Elastic Compute Cloud. Once provisioned 
into one or more of these virtualized resource classes, 
vCDSs can use iRODS ’s federation capabilities to create 
an integrated ecosystem of managed collections that is 
scalable and adaptable to changing resource require- 
ments. This approach enables platform- or software-as- 
a-service deployment of vCDS and allows the NCCS to 
offer virtualization-as-a- service: a capacity to respond in 
an agile way to new customer requests for data services. 

Index Keyword Terms — iRODS, climate data serv- 
ices, cloud computing, software appliance, virtualization 

1. Introduction 

The NASA Center for Climate Simulation (NCCS) 
provides large-scale compute engines, analytics, data 
sharing, long-term storage, networking, and other high- 
end computing services designed to meet the specialized 
needs of the Earth science communities. By doing so, 
NCCS brings NASA observational and model data 
products to climate research carried out by a wide range 
of national and international organizations [1]. 


Last year, we examined the potential of iRODS, the 
Integrated Rule-Oriented Data System, as a means of 
integrating, archiving, and delivering scientific data to 
the communities we serve. We built a testbed collection 
of independent iRODS data systems comprising obser- 
vational and simulation data and used the testbed to 
learn about iRODS and understand how the technology 
might further our mission. We came away from that ex- 
ercise believing that iRODS could provide a useful plat- 
form upon which to build a collection of scientific data 
services tailored to the needs of our customers [2]. 

This year, we have worked to build an operational 
iRODS capability for the NCCS. The result is a product, 
architecture, and approach we refer to as the Virtual 
Climate Data Server (vCDS), a software appliance spe- 
cialized to the needs of climate data collections man- 
agement. In the following sections, we describe our ex- 
periences with vCDS, including motivation and ration- 
ale for the approach, implementation details, and future 
plans regarding scientific data services in the NCCS. 

2. Background 

Data services and data publication are becoming 
increasingly important aspects of NCCS’s mission. For 
example, our two major customers, NASA's Global 
Modeling and Assimilation Office (GMAO) and the 
Goddard Institute for Space Studies (GISS) are contribut- 
ing products to the Intergovernmental Panel on Climate 
Change (IPCC) Fifth Assessment Report (AR5) [3]. Data 
products for the IPCC AR5 assessment need to be pub- 
lished to the broader community through the Earth Sys- 
tem Grid (ESG) [4]. GMAO computes the Modem Era 
Retrospective-Analysis for Research and Applications 
(MERRA) data set in the NCCS, which in turn we con- 
vey (publish) to the Goddard Earth System Data Infor- 



mation and Information Services Center (GES DISC). 
And the NCCS will be computing the Level 4 root-level 
soil moisture product for NASA’s Soil Moisture Active 
Passion (SMAP) mission when it launches in 2014. 

As suggested by these examples, the diversity of 
our customer base as well as the diversity of the data 
itself are increasing. Our customers now include indi- 
vidual scientist, labs, research projects, flight missions, 
and even non-traditional, private-sector consumers of 
climate simulation products such as the insurance/re- 
insurance industry. The datasets involved may be prod- 
ucts generated by a General Circulation Model (GCM), 
observational data, reanalysis data, or specialized de- 
rived products requiring the combination of simulation 
and observational data. Depending on the circum- 
stances, management of the data may require short-term 
storage, long-term archival preservation, or some type of 
hierarchical staging to accommodate interactive visuali- 
zation or use by an application. 

3. vCDS Concept and Rationale 

Our notion of a Virtual Climate Data Server has 
grown out of a simple use case that we developed to 
capture the essence of this new data challenge. 

A customer approaches the NCCS with a new 
dataset they want us to manage: What tech- 
nology is needed to quickly meet that cus- 
tomer s requirement under the following con- 
straints: 

• The solution should be: simple, fast, and 
affordable; 

• provide core capabilities to get started, but 
extensible to accommodate future needs; 

• be flexible, with the ability to use, opti- 
mize, and change deployment configura- 
tions in response to resource availability; 

• allow the new dataset to be integrated into 
an existing data collection; and 

• come with a help desk and user support? 

The answer we returned to repeatedly is that we 
need a data server software appliance specialized to the 
needs of a managed collection of climate-related scien- 
tific data. Furthermore, in order to be agile and respon- 
sive, this appliance needed to be able to use the flexible 
resource allocation capabilities afforded by cloud com- 
puting. To allow for ease of collections integration and 
support the full information lifecycle requirements of a 
scientific archive, it should be built around the iRODS 
technology. Hence, our notion of an iRODS -based Vir- 
tual Climate Data Server whose core functionality and 
suite of ancillary utilities would support our expanding 
climate data services mission. 


4. vCDS Architecture 

The basic configuration of an iRODS data server 
consists of a specific version of iRODS installed on a 
particular operating system running on particular hard- 
ware. Moving toward a virtual appliance model has been 
a two-step process in which we (1) encapsulate the op- 
erating system and iRODS as a virtual machine image, 
then (2) specialize that image with functionality required 
for managing climate data. Our approach to specializa- 
tion has been to build general-purpose scientific “kits” - 
such as NetCDF, HDF, and GeoTIF - that sit in the ver- 
tical stack above iRODS and below application-specific 
climate kits such as IPCC, MERRA, and SMAR 
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Our initial focus has been building a vCDS to man- 
age IPCC AR5 NetCDF data. Details about these com- 
ponents are provided below, but in summary, the core 
elements include the following: 

• Application-specific microservices — Canonical 
archive operations, particularly the mechanisms re- 
quired to ingest Open Archive Information System 
(OAIS)-compliant Submission Information Package 
(SIP) metadata for IPCC NetCDF objects. 

• Application-specific metadata — OAIS-compliant 
constitutive (application-independent) Representa- 
tion Information (RI) and Preservation Description 
Information (PDI) metadata for IPCC NetCDF ob- 
jects. 

• Application-specific rules — IPCC NetCDF triggers 
and workflows. 





• A specific release of iRODS — In the current version 
we have used iRODS 2.5 that has been augmented 
with what we refer to as Administrative Extensions 
(AE). 

• A specific operating system — In our case, SEES 1 1 
SP I. 

At the time of this writing we have built vCDS Ver- 
sion 0.9. Collectively, we refer to the functionality asso- 
ciated with vCDS as a the vCDS V0.9 “product suite.” It 
includes (1) NetCDF/IPCC Kits, (2) Administrative Ex- 
tensions, utilities that enable (3) Repetitive Provision- 
ing, and mechanisms for (4) Deployment and Distribu- 
tion. 

The technology readiness level (TRL) of VO. 9 is 
approximately 7, meaning we have completed system 
prototyping and demonstration in an operational envi- 
ronment and the system is at or near scale of an opera- 
tional system, with most functions available for demon- 
stration and test [5]. The software used in the vCDS 
VO. 9 stack is shown in the table below. 


Name 

Version 

Notes 

iRODS 

2.5 

Core iRODS installation. 
Includes i-commands. 

Extrods 

1.1. 0.1 -beta 

Officially provided IRODS 
web Ul. 

PHP 

5.2.14 

Required for IRODS web 
Ul. 

Apache web server 

2.2.10 

Required to serve IRODS 
web Ul. 

FUSE library 

2.7.2 

Base FUSE library re- 
quired for IRODS FUSE 
interface. 

Postgresql 

8.3.14 

Required RDBMS for iCAT. 

UnixODBC 

2.2.12 

Required for IRODS com- 
munication to ICAT. 

PyRods 

2.5 

Community provided Py- 
thon wrapper for IRODS 
libraries. 

EmbedPython 

2.5 

Community provided 
IRODS extension that 
allows for Python based 
microservice development. 

Python 

2.6 

Core Python environment, 
needed for PyRods and 
EmbedPython. 

iRODS-NCCS 

0.7 

Custom Python based 
microservices for data 
handling. 

iRODS-web 

0.7 

Java application for view- 
ing IRODS audit history 
and usage statistics. 

Apache Tomcat 

7.0.14 

Java Servlet container that 
serves IRODS-web-stats 
application. 

PyGreSQL 

4.0 

Python postgres driver 

JDBC 

2.5.5 

Java database connectivity 
required for NetCDF 

Java Runtime 

1 .6.0_24 

Java runtime, required for 
IRODS-web-stats. 

Ncdump-hdf 

4.2.5 

Library for interacting with 
HDF files, required for 
IRODS-nccs 

SLES 

11,sp1 

Base OS. 


4.1. NetCDF/IPCC Kits 

The core functionality of vCDS VO. 9 resides in the 
NetCDS and IPCC kits that contain the iRODS micros- 
ervices, rules, configuration settings, and software utili- 
ties required to implement the system’s canonical opera- 
tions. These functions include the following: 

• Basic system-level operations of an archive: Create, 

Read, Update, and Delete (CRUD); 

• Rules to identify IPCC NetCDF files; 

• Microservices to manage OAIS Information Ob- 
jects: 

- Submission Information Packages (SIPs) 

- Archive Information Packages (AIPs) 

- Distribution Information Packages (DIPs); 

• The iRODS iCAT along with optimizations to man- 
age and view metadata. 

These capabilities are central to managing the 
producer/consumer relationships of an archive. Manag- 
ing metadata separate from the storage objects them- 
selves is key to this since doing so enables discovery, 
long term curation, reuse of the data, and use of the data 
for unintended purposes: it is what distinguishes an ar- 
chive or managed collection of scientific data from the 
typical bit storage functions of a filesystem. 

To demonstrate how we have approached managing 
metadata in vCDS VO. 9, we first describe the internal 
metadata structure of the NetCDF file format and how it 
has been specialized by the IPCC for use in their data 
products. Then, we show how we externalize the em- 
bedded NetCDF metadata in a way that makes the vCDS 
system OAIS compliant. 

4.1.1 NetCDF Metadata 

IPCC climate model outputs are stored as NetCDF 
files. NetCDF (Network Common Data Form) is a set of 
software libraries and self-describing, machine- 
independent data formats that support the creation, ac- 
cess, and sharing of array-oriented scientific data. The 
project homepage is hosted by the Unidata program at 
the University Corporation for Atmospheric Research 
(UCAR). They are also the chief source of NetCDF 
software, standards development, updates, etc. The for- 
mat is an open, international standard of the Open Geo- 
spatial Consortium [6]. 

The figure below provides an example of the meta- 
data typically embedded in an IPCC file. The NetCDF 
header contains general information about the accompa- 
nying data as well as specific information required by a 
NetCDF-aware application to index into the accompany- 
ing records to use the files’s data. The organization of 
this embedded header information is stipulated by the 
research community through the Climate Model Inter- 
comparison Project (CMIP5) Data Reference Syntax 
(DRS) and Controlled Vocabularies specification [7]. 




IPCC NetCDF Header Metadata 

netcdf cSoil_Lmon_GISS-E2- 
R_piControl_r1 i1 p1_3981 01 -40051 2 { 
dimensions; 

time = UNLIMITED ; // (300 currently) 
lat = 90 ; 

Ion = 144 : 
bnds = 2 : 
variables: 
double time(time) ; 
time:bounds = "time_bnds" ; 
time:units = "days since 3981-1-1" ; 
time:calendar = "365_day" ; 
time .axis = "P' ; 
time:long_name = "time" ; 
time:standard_name = "time” ; 
double time_bnds(time, bnds) ; 
double lat(lat) ; 

latibounds = "lat_bnds" ; 
latiunits = "degrees_north" ; 
lat:axis = "Y" ; 
lat:long_name = "latitude" ; 
lat:standard_name = "latitude" ; 
double lat_bnds(lat, bnds) ; 
double lon(lon) ; 

lonibounds = "lon_bnds" ; 
loniunits = "degrees_east" ; 
lon:axis = "X" ; 

lon:long_name = "longitude" ; 
lon:standard_name = "longitude" ; 
double lon_bnds(lon, bnds) ; 
float cSoil(time, lat, Ion) ; 

cSoil:standard_name = "soil_carbon_content" ; 
cSoil:long_name = "Carbon Mass in Soil Pool" ; 
cSoihunits = "kg m-2" ; 
cSoil:original_name = "dummy" ; 


NetCDF Header 

1* fixed-size variable 

fixed-size variable 


N* fixed-size variable 

1* record for 1“ record variable 

1* record for 2 °^ record variable 

i 

record for i'’’ record variable 

2"*' record for 1** record variable 

2"** record for 2“* record variable 


2“** record for r*" record variable 


cSoil:cell_methods = "time: mean area: 
mean where land" ; 

cSoil:cell_measures = "area: areacella" ; 
cSoihhistory = "2011 -03-21 T20:30:30Z 
altered by CMOR: replaced missing value 
flag (-1e+30) with standard missing value 
(1e+20)." : 

cSoil:missing_value = 1.e+20f ; 
cSoil:_FillValue= 1.e+20f ; 
cSoihassociatedJiles = "baseURL: 
http://cmip-pcmdi.llnl.gov/ 
CMIP5/dataLocation gridspecFile: 


gridspec_land_fx_GISS- 

E2R_piControl_r0i0p0.nc areacella: 
areacella_fx_G ISS-E2R_ 
piControLrOiOpO.nc" ; 

// global attributes: 

:institution = "NASA/GISS (Goddard Institute for Space 
Studies) New York, NY" ; 

:institute_id = "NASA-GISS" ; 

:experiment_id = "piControl" ; 

:source = "GISS-E2-R-E135F40oQ32 
Atmosphere: GISS-E2; Ocean; R" ; 

:modelJd = "GISS-E2-R" ; 

:forcing = "N/A" ; 

:parent_exp>eriment_id = "N/A" ; 

:parent_experiment_rip = "N/A" ; 

:branch_time = 0. ; 

:contact = "Kenneth Lo (cdkkl@giss.nasa.gov)" ; 
:references = "www.giss.nasa.gov/research/modeling" ; 
:initialization_method = 1 ; 

:physics_version = 1 ; 

:tracking_id = "3aed12dd-f6e0-44d1-ade9- 
c064bd355edc" ; 

:product = "output" ; 

:experiment = "pre-industrial control" ; 

:frequency = "mon" ; 

:creation_date = "2011 -03-21 T20;30;31Z' : 

:history = "2011 -03-21 T20:30:31Z CMOR rewrote data 
to comply with CF standards and CMIP5 requirements." ; 
:Conventions = "CF-1 .4" ; 

:project_id = "CMIP5" ; 

:table_id = "Table Lmon (31 January 2011) 
a84ae296f75bb85ff61 668fac8fcf090" ; 

:title = "GISS-E2-R model output prepared for CMIP5 
pre-industrial control" ; 

:parent_experiment = "N/A" ; 

:modeling_realm = "land" ; 

: realization = 1 ; 

:cmor_version = "2.5.7" ; } 


It is generally the case that facilities such as the 
NCCS externalize a small amount of this embedded 
metadata through file and path naming conventions as 
a way of organizing their NetCDF collections. The 
next two figures demonstrate how this is currently 
done in the NCCS and how with an iRODS -based 
vCDS we are able to externalize all of the embedded 


metadata, storing it in the iCAT. Doing so makes it 
possible to search over these encapsulated attributes 
without having to open individual files in the collec- 
tion. The primary work of the NetCDF/IPCC kits in 
VCDS VO. 9 is managing the extraction of this core, 
application- and use-independent embedded metadata. 


IPCC NetCDF Header Metadata 

netcdf cSoil_Lmon_GISS-E2- 
R_piControl_r1 i1 pi _3981 01 -40051 2 { 
dimensions: 

time = UNLIMITED ; // (300 currently) 
lat = 90 ; 

Ion = 144; 
bnds = 2 ; 
variables; 
double time(time) ; 
time:bounds = "time_bnds" ; 
time:units = "days since 3981-1-1" ; 
time:calendar = "365_day" ; 
time:axis = "T' ; 
time:long_name = "time" ; 
time:standard_name = "time" ; 
double time_bnds(time, bnds) ; 
double lat(lat) ; 

lat:bounds = "lat_bnds" ; 
lat:units = "degrees_north" ; 
lat:axis = "Y" ; 
lat:long_name = "latitude" ; 
lat:standard_name = "latitude" ; 
double lat_bnds(lat, bnds) ; 
double lon(lon) ; 

lon:bounds = "lon_bnds" ; 
lon:units = "degrees_easf 
lon:axis = "X" ; 

lon:long_name = "longitude" ; 
lon:standard_name = "longitj 
double lon_bnds(lon, bnds) ; 
float cSoil(time, lat, Ion) ; 

cSoil:standard_name = "soil_carbon_content" 
cSoil;long_name = "Carbon Mass in Soil Pool' 
cSoihunits = "kg m-2" ; 
cSoil;original_name = "dummy" ; 


File Name, variable _ MIP Table _ Model _ Experiment _ ensemble number _ temporal domain 


/portal/CISS/ARS/piControl/E2-R_piControl_rlilpl 
> Is -al 

-iw-r— r— 1 giss admin 20828964 Jul 18 08:05 cSoil. 
-rw-r — r — 1 giss admin 20828964 Jul 18 08:05 cSoil. 


Lmon_CI5S-E2-R_piConlrol_rlilpl_398101-400511.nc 

Lmon_CISS-E2-R_piControl_rlilpl_398101-400512.nc 




y^recQ^foMVecoi^^ariaM 
2°*^ rerord for record variable 


2”** record for record variable 


A limited amount of 
IPCC NetCDF header 
infonnation is currently 
externalized through 
filesystem naming and 
pathing conventions ... 



3ace 


Lo (pdkl^l@giss.nasa.gov)" ; 

.gov/research/modeling" 

1 ;■ 


:experiment_i 
:source = "GISS- 
Atmosphere: GISS-I 
:modelJd = "GISS-I^-R 
:forcing = "N/A" ; 

: parent_expe ri ment 
:parent_experiment 
:branch_time = 0. ; 

:contact = "Kenneth 
: references = "www 
;initialization_metpod 
;physics_version 
:tracking_id = "3aed12dd-f6e0-44d1-ade9- 
c064bd355edc" ; 

:product = "output" ; 

:experiment = "pre- industrial control" ; 

:frequency = "mon‘ ; 

:creation_date = "2D11-03-21T20:30:31Z' ; 

:history = "2011-03-21T20:30:31Z CMOR rewrote data 
to comply with CF standards and CMIP5 requirements." 
:Conventions = "C --1 .4" ; 

:project_id = "CMI *5" ; 

:table_id = "Table .mon (31 January 2011) 
a84ae296f75bb8 >ff61 668fac8fcf090" ; 

:title = "GISS-E2-F model output prepared for CMIP5 
pre-industrial cor trol" ; 

:parent_experime it = "N/A" ; 

:modeling_realm = : "land" ; 

;realizatlon = 1 ;< 

:cmor_version = "2.5.7" ; } 



IPCC NetCDF Header Metadata 

netcdf cSoil_Lmon_GISS-E2- 
R_piControl_r1 i1 p1 _3981 01 -40051^ 
dimensions; 

time = UNLIMITED ; // (300 currently) 

lat = 90 : 

Ion = 144 ; 
bnds = 2 : 
variables: 
double time(time) ; 
time;bounds = "time_bnds" ; 
time:units = "days since 3981 - 1 - 1 " 
time:calendar = "365_day" ; 
timeiaxis = "P' ; 
time:long_name = "time" ; 
time:standard_name = "time" ; 
double time_bnds(time, bnds) ; 
double lat(lat) ; 

latibounds = "lat_bnds" ; 
latiunits = "degrees_north" ; 
latiaxis = "Y" ; 
lat:long_neime = "latitude" ; 
lat:standard_name = "latitude" 
double lat_bnds(lat, bnds) ; 
double lon(lon) ; 

lonibounds = "lon_bnds" ; 
loniunits = "degrees_east" ; 
loniaxis = "X" ; 
lon:long_name = "longitud^ 
lon:standard_name = "longitude" ; 
double lon_bnds(lon, bnds)/ 
float cSoil(time, lat, Ion) ; 4 

cSoil:standard_name = "soiLcarboiV content" 
cSoil:long_name = "Carbon Mass Soil Pool" ; 
cSoihunits = "kg m-2" ; 
cSoil:original_name = "dummy” ; 



gridspec_land_fx_GISS- 

E2R_piControl_r0i0p0.nc areacella: 
areacella_fx_G ISS-E2R_ 

piControLrOiOpO.nc^U- 

//glob^ 

glffution = "NASA/GISS (Goddard Institute for Space 
Studies) New York, NY" ; 
linstitutejd = "NASA-GISS" ; 

3 n t _ id - " pi C on t rol ’’ 

.^so^ = "GISS-E2-R-E1 35F40OQ32 
AtmospftBm~GlS§::E2: Ocean: R" 

:modelJd = "GISSt2=W 
:forcing = "N/A" ; 

,^parent_experiment_id = "N/A" ; 
fit_experiment_rip = "N/A" ; 
itime = 0. : 

:t^<Kenneth Lo (cdkkl@giss.nasa.gov)" ; 
:references^'*!;^;^.giss.nasa.gov/research/modeling" 
:initialization_methQd = 1 ; 

physics_version = 

ackingjd = "3aed12dd-f6e0-44d1-ade9- 
c06^ 

:prodt^t = "output" ; 
y :experihaent = "pre-industrial control" ; 

Vrequen^= "mon" ; 

VeationJ^e = "2011 -03-21 T20:30:31Z' : 

:hikory = "20^ -03-21 T20:30:31Z CMOR rewrote data 
to ^mply witnsCF standards and CMIP5 requirements." 
:Con^ntions = '^-1 .4" 

:projecV id = "CMIP 
:table_ia\p "Table Lm>sm (31 January 2011) 
a84ae29W5bb85ff61^8fac8fcf090" ; 

:title = "Gl^-E2-R mod^utput prepared for CMIP5 
pre-industri^ control" ; 

:parent_experninent = "N/A" 

:modeling_reali\ = "land" ; 

:realization = 1 ; 

:cmor_version = "2.5.7" ; } 


4.1.2 OAIS Compliance 

An Open Archival Information System (OAIS) is an 
archive consisting of an organization of people and sys- 
tems that has the responsibility to preserve information 
and make it available for a designated community [8]. 
The OAIS reference model addresses a full range of 
archival information preservation functions including: 

• ingest, data management, access, and dissemination; 

• the migration of digital information to new media 
and forms; 

• the data models used to represent the information, 
the role of software in information preservation, and 
the exchange of digital information among archives. 

• the identification of both internal and external inter- 
faces to the archive functions; 

• it identifies a number of high-level services at these 
interfaces; 

• it provides various illustrative examples and some 
‘best practice’ recommendations; 

• and it defines a minimal set of responsibilities for an 
archive to be called an OAIS. 

One of the goals for the IPCC/NetCDF kits was to 
enable them to create collections that are compliant with 
the OAIS standard. As a starting point, that means the 
metadata about objects in the collection needs to be 
categorized into the types of metadata recognized by the 
OAIS standard. Metadata in an OAIS-compliant collec- 
tion is organized around the concept of an Archive In- 
formation Package (AIP), which contains the following 


classes of metadata: 

• Representation Information (RI)\ Metadata that ex- 
plains how to interpret the raw data; 

• Preservation Description Information (PDI): Pres- 
ervation related information, such as: 

Provenance - Describes source, custody trail, and 
history. 

Context - Describes relationships with internal/ 
external data. 

Reference - Describes unique identifiers. 

Fixity - Describes protection from unauthorized 
alteration; 

• Policy Metadata (PM)\ Parent organization archive 
administration information. 

Fixed (from organization business model) 

Negotiable (from data producers and consumers) 

• Discovered Metadata (DM): Producer/consumer 
information that will foster maximal use of the ar- 
chive. 

The vCDS V0.9 iCAT database was extended to 
accommodate this OAIS metadata classification. The 
database schema was designed to reflect the uniqueness 
and repetitiveness of various NetCDF metadata items. 
For example, while the global attributes for each file are 
considered unique, the dimension, variable and function 




definitions are often common across many files. The 
metadata idiosyncrasies influencing the database schema 
include the following: 

• coordinate variables (input to model) have an asso- 
ciated dimension and function as well as a bound 
that defines the grid size for the associated dimen- 
sion. These metadata are not unique to a given file; 

• output variables (output from model) have an asso- 
ciated function that is not unique to a given file; 

• all variables are considered to be OAIS Representa- 
tion Information; 

• all global attributes are considered to be OAIS Pres- 
ervation Description Information. As such they addi- 
tionally have an OAIS-subcategory. 

To accommodate the repetitive nature of some me- 
tadata, we implemented the following 5 -table design: 

• Meta_main — This table contains records for each 
metadata item considered unique to a given file. It is 
composed of variable and global attribute metadata 


defined as name, value, and unit 3 -tuples. OAIS 
metadata categories and subcategories as well as 
foreign keys to the other 4 tables are also included as 
required per record type. 

• Meta_header — This table contains the filename and 
an index used as a foreign key by meta main to link 
all the records associated with the file in meta main 
back to the filename in meta header. This table also 
includes a foreign key reference to the filename in 
the iCAT. 

• Meta_dimension — This table contains only unique 
entries for dimension name and dimension size. 
Many files may have the same set or subset of di- 
mensions. It also contains an index used as a foreign 
key by each coordinate variable record in me- 
ta_main to link the coordinate variable with its asso- 
ciated dimension. 

• Meta_function — This table contains only unique 
entries for function name and function type. Many 
files may have the same functions. It also contains 
an index used as a foreign key by each coordinate 
and output variable record in meta_main to link the 
variable with its associated function. 

• Meta_bound — This table contains only unique en- 
tries for bound function name and bound function 
type. Many files may have the same bound func- 
tions. It also contains an index used as a foreign key 
by each coordinate variable record in meta_main to 
link the coordinate variable with its associated 
bound function. 

Note that, in the database schema diagram shown 
below, closed dots identify record entries defined as 
‘NOT NULL’ while open dots identify entries not de- 
fined as such. 


mciA^header 


• hcader.id serial (10) 

• filename varchar (BOO) 
o kaUd varcriar (10) 

• last. update timestamp 


meia^dlrnension 


• dimcnsion.id serial (10) 

• dim.name varchar (20) 

• dim.value varchar ^0) 

• last. update timestamp 


meta.main.hdIrVl.fkev 


i.main.dim.id.fkev 


meta_bound 


• bound. Id serial (10) 

• bnd.name varchar (50) 

• bnd.tvi>e varchar (20) 

• last. update timestamp 


^fTtetamain.bnd.id.fkev 


# main.id 

serial (10) 

• hdr.id 

int4 (10) 

• category 

varchar (100) 

e key. name 

varchar (100) 

e key. value 

varchar ^00) 

0 key. unit 

varchar (20) 

• oais.category 

varchar (20) 

0 oais. subcategory varchar (20) 

o dwn.ld 

int4 (10) 

0 fuitc.ld 

int4 (10) 

0 brvd.ld 

ini4 (10) 

e last. update 

timestamp 



] rrieta_funaion 

#function_id serial (10) 

• func.name varchar (50) 
U func.tvpe varchar (20) 
e last. update timestamp 

rmifnTTunod.fkev 




In vCDS V0.9, the “Create” 
operation extracts the embedded, 
application-independent metadata 
from IPCC’s NetCDF files and saves 
those values in tables that recognize 
OAIS metadata categories. This, in 
essence, begins construction of an 
OAIS Submission Information Pack- 
age (SIP). At the moment of object 
creation, the NetCDF header infor- 
mation comprises primarily Repre- 
sentation Information (metadata on 
how to interpret the raw data in the 
accompanying file) and Preservation 
Description Information (PDI). The 
intent in future work is to fill out 
other classes of metadata, such as 
Policy Metadata (PM) and Discov- 
ered Metadata (DM) as more detailed 
policies are developed. 




4.2. Administrative Extensions 

A second element in the vCDS product suite is a 
collection of capabilities we refer to as Administrative 
Extensions (AE). These include iRODS Postgres ex- 
tensions and utilities to log system-level object prove- 
nance and provide QA for OAIS metadata compliance 
plus associated Rich Web Browser GUI extensions. 

To view the NetCDF metadata by OAIS metadata 
categories, pre-defmed SQL queries were add to the 
iRODS Rich Web Client. 


Displays of RI and PDI are depicted in the fol- 
lowing images. First is a sample coordinate variable 
query result showing selected metadata entries from 
meta_main (Name, Value, Unit, Type, OAIS Category, 
OAIS Subcategory), from meta bound (Bound), from 
meta dimension (Dimension), from meta function 
(Function) and from meta header (ICAT id). Note that 
these coordinate variables are categorized as Represen- 
tation Information (RI). 



Next is a sample global attribute query result 
showing selected metadata entries from meta main 
(Name, Value, Unit, Type, OAIS Category, OAIS Sub- 
category) and from meta header (ICAT id). Note that 


global attribute have no dimension, no function and no 
bound. Also note that these global attributes are cate- 
gorized as Preservation Description Information (PDI) 
each with subcategory Context (CTX). 


^ fS f*> rods://rods@localhost:1247/nccsZone/home/rods/portal/CISS/ARS/historical/E2-H_historical_r2ilpl 







4.3 Repetitive Provisioning 

We work in a virtualized environment that in- 
cludes MacBooks running VMware Fusion, a vSphere 
dev/test server farm, a NASA cloud computing envi- 
ronment called Nebula, as well as Amazon’s Elastic 
Compute Cloud (EC2). As one of the major accom- 
plishments of this software release, we developed RPM 
Package Manager (RPM) scripts to build software 
stacks for SEES 11 SPl, iRODS 2.5 with Administra- 
tive Extensions, and vCDS VO. 9 virtual images. 



5. Operational Deployment 

vCDS VO. 9 is being deployed in the Amazon Elas- 
tic Compute Cloud where it will be hardened for op- 
erational use. These end-of-system-development activi- 
ties will essentially elevate vCDS VO. 9 to Vl.O at TRL 
8/9. Its first application will be to manage a collection 
of IPCC AR5 data in EC2 for publication through the 
Earth System Grid. Since the ESG gateway requires a 
filesystem view of the data it serves, we are using 
FUSE (File System in User Space) to expose the vCDS 
collection to ESG. 






(3) Generate 
THREDDS Catalog 


Red 


Hat 



4.4 Deployment and Distribution 

RPM makes it easy to set up automated build and 
install procedures consisting of many packages for an 
entire operating system. When these images are provi- 
sioned into a virtual cloud resource, capabilities can be 
delivered as Infrastructure-as-a- Service (laaS) (e.g. 
SEES 11 SPl), Platform-as-a-Service (PaaS) (iRODS 

2.5 AE), and Software-as-a-Service (SaaS) (vCDS 
VO. 9). Collectively, our ability to provision into these 
various resource classes enables virtualization-as-a- 
service (VaaS), which is huge unmet need in our do- 
main. 



JU 


/\ 




7. Conclusions 


6. Discussion 


Taken together, the elements of this work that we 
refer to as the vCDS product suite — the NetCDF/ 
IPCC kits, administrative extensions, and utilities for 
automatic provisioning, and deployment and distribu- 
tion — enable an approach to scientific collections 
management in which virtualization is a driving con- 
cept. It supports access to a tiered array of cloud serv- 
ices that are flexible, adaptable, scalable, and stageable 
to NCCS “bricks and mortar” facilities as needed. We 
can provision capabilities into any resource class, mi- 
grate images from one resource class to another, and 
use the iRODS federation mechanism to assemble vir- 
tual collections that cross resource classes. This ap- 
proach provides an agile entry point into the NCCS for 
new customers with data-centric requirements and en- 
ables virtualization-as-a-service. 

With the vCDS approach, we are trying to enable 
the full information lifecycle management of OAIS- 
compliant scientific data collections. A vCDS manages 
data as a distinguished collection for a person, project, 
lab, or other logical unit. A vCDS can manage a collec- 
tion across multiple storage resources using rules and 
microservices to enforce collection policies. And a 
vCDS can federate with other vCDSs to manage multi- 
ple collections over multiple resources thereby creating 
what can reasonably be thought of as an ecosystem of 
managed collections. 




Up to now, we have been focusing on publishing 
IPCC AR5 data to the Earth System Grid. In OAIS 
parlance, GMAO and GISS are our first data produc- 
ers, ESG (an application) is the first consumer. The 
first customer will be the vCDS Collection Administra- 
tor, and the IPCC AR5 dataset will be our first vCDS 
managed collection. Follow-on work will focus on 
expanding the array of managed collections and broad- 
ening our community of users, which means expanding 
vCDS policies, rules, and microservices. 
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